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Leakage is one of the failures which commonly happens in boiler operation. 
Moreover, a continuous unsettled anomaly in a boiler could lead to leakage 
failure. An algorithm has been developed to predict the failure, consisting of 
three general procedures: feature selection, followed by hierarchical 
clustering, and naive Bayes classification. The hierarchical clustering 
changes unlabeled data into labeled data, and naive Bayes classification 
calculates the probability to justify anomaly occurrence. Meanwhile, this 
research focused on the effect of the feature selection method on the result of 
leakage prediction. Two different feature selection methods, namely the 
structural analysis and the principal component analysis (PCA), were 
deployed separately and then compared. The result showed that leakage 
prediction using the structural analysis method gave 13 hours 40 minutes of 
prediction time, and the PCA method gave 25 hours of prediction time. 
However, the PCA feature selection method caused more false alarms than 


feature selection with structural analysis, which only triggered five false 
alarms a week before leakage. Moreover, the structural analysis offered 
better traceability than PCA to understand the leakage occurrence. 
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1. INTRODUCTION 

Process monitoring is one of the important aspects in industries. It helps the operators to evaluate 
the equipment or component. The information received from process monitoring can be appropriately 
managed to bear helpful information, such as predicting failure [1]. However, several aspects need to be 
considered before predicting a component's failure, such as the parameters describing the related failure, the 
characteristics of the failure itself, and the method to specify the trend of the degradation [2]. 

The research object is the boiler failure prediction, specifically boiler leakage prediction. 
Leakage on the boiler usually happens in the superheater, where the process of extreme combustion takes 
place [3]-[6]. The extended anomalous trend in the process monitoring, especially the unresolved anomaly, 
tends to cause the related component or equipment failure if it is not well controlled [2], [7]-[9]. For that 
reason, superheater leakage can be predicted by detecting the early anomaly in the related area. 

A related study about principal component analysis (PCA) and the clustering process was also 
conducted to detect anomalies in a marine engine system without comparison to the maintenance logs [10]. 
It used PCA to reduce 24 variables into seven principal components. However, the implementation of PCA 
and the absence of maintenance logs resulted in difficulty interpreting the clustered data. Because of that, 
maintenance logs were fully considered in this practice to help process failure understanding. 
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This study aims to acknowledge how different feature selection methods could make a significant 
difference in the leakage prediction result. Two methods were chosen, which are structural analysis and PCA. 
Combined with the clustering and classification processes, these methods were used to detect the superheater 
anomaly. This research used hierarchical clustering and naive Bayes classification, respectively, as the 
methods for clustering and classification. These methods successfully have been used in the same case of the 
leakage in boiler superheater and predicted the leakage 13 hours 40 minutes earlier using three selected 
variables by structural analysis in the previous research [11]. 

This research will offer a clear view about the comparison of the "black box" (PCA) and the "grey 
box" (structural analysis) feature selection would make a significant impact on the method ability 
(hierarchical clustering and naive Bayes classification) to predict superheater leakage. The structural analysis 
depends on understanding the combustion process in a boiler to select process variables and determine how 
the variables contribute to a failure [5]. On the other hand, PCA offers a more straightforward approach, 
where the operator can choose any variables and does not have to ponder the principle of the boiler 
process [12]. Both feature selection methods have each strength and weaknesses. Briefly, this research will 
provide some comparative studies on structural analysis and PCA in selecting features for leakage prediction 
in superheater boilers. The comparison includes how long each method can predict the leakage, how many 
false alarms are triggered during the process, and how they interpret the leakage. 


2. RESEARCH METHOD 

The flow diagram of this research method is shown in Figure |. This section will discuss the 
primary procedure in the flow diagram, such as structural analysis, PCA, hierarchical clustering, and naive 
Bayes classification. The first step was acquiring the dataset. The dataset was taken during a similar load of 
the boiler to increase the consistency of the measurement because the different loads on the boiler will cause 
dissimilar the mean of regular operation [4]. Then the data underwent two processes separately: structural 
analysis and PCA. In the first processs, some variables were selected as input variables for the clustering and 
classification process in structural analysis. On the other process the data went through PCA and transformed 
into lower-dimensional data as the input and then processed for clustering and classification tasks. Finally, 
both calculations were compared and analyzed. 


2.1. Data acquisition 

This research uses primary data taken from the real-time data logging in the hydroelectric power 
plant. This research used two datasets: a dataset containing normal data operation (with a minor disturbance) 
and another dataset containing major leakage failure in the pipeline that caused the unit to be shut down. 
Each dataset has 1,008 samples with an interval of 10 minutes. 


2.2. Structural analysis 

The structural analysis relies on understanding boiler process operation to pick reliable variables for 
the following process. A brief illustration of the boiler can be seen in Figure 2, and the available 
measurements are listed in Table 1. How the leakage was triggered becomes valuable information to choose 
the potential variables to detect the anomaly that initiates the leakage. In many boiler leakage cases, some 
were mainly caused by overheating on the piping that disturbs the heat transfer to the steam and consequently 
causes the pipe to rupture due to the overheating process [5], [13]. Overheating occurs in areas where the 
combustion process plays a huge role, such as superheater, reheater, and so on [14]. Based on the available 
measurement data and the physics phenomenon within the process, three variables were selected based on 
structural analysis from the list in Table 1. PSH (primary superheater) inlet temperature, PSH outlet 
temperature, and SSH (secondary superheater) inlet temperature. Figure 3 shows a brief diagram of how the 
three selected variables integrate with the process component in the boiler. 


2.3. PCA 

The principal component analysis is a statistical method reducing high dimensional data into the 
lower one with keeping the essential variation and any important information in it [15], [16]. Moreover, by 
using PCA in-process monitoring, any knowledge before the system is unnecessary because of its ability to 
handle the high degree of correlation for each variable. It can also reduce the computational cost by 
decreasing data dimensions [17]. 

Research related to PCA was conducted to perform anomaly detection for a boiler generally during 
a load changing in boiler operation and successfully detect abnormality during the load transition [18]. 
However, it can not give a clear physical or cause-effect explanation about what happened during the 
transition. This research used 12-dimensional variables of boiler operation and then was transformed into the 
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lower dimensional dataset, which principal components will represent. The number of principal components 
were determined by how many components required to obtain at least 90% variance of the actual data [12]. 
The list of the 12 boiler operative variables used in this research is shown in Table 1. It shows that the 
efficient number of principal components is 6, which can obtain at least 90% of the variance. Table 2 gives 
the information of principal components and the variance. 
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Figure 1. Flow diagram of the research Figure 2. Position of the sensors in the boiler 


Table 1. List of operative variables in the boiler 
Variable 
PSH: inlet temperature 
PSH: outlet temperature 
SSH: inlet temperature 
Primary air heater: outlet pressure 
Primary air duct: pressure 
Primary air heater: inlet air temperature 
Primary air heater: flue gas temperature 
Riser to steam drum: water temperature 
Secondary air duct: pressure 
Secondary air heater: inlet air temperature 
Secondary air heater: inlet flue gas temperature 
Secondary air heater: outlet flue gas temperature 
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Figure 3. Block diagram of the superheater 


Table 2. Accumulative variance on principal components 


Principal Component Accumulative 
(PC) Variance (%) 
PC 1 46.5 
PC 2 60.9 
PC3 71.6 
PC 4 79.6 
PC5 86.9 
PC 6 92.5 
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2.4. Hierarchical clustering 

After the input variables are obtained from both feature selection methods, the next step is the 
clustering process. This research will use a hierarchical method for the clustering process. Hierarchical 
clustering is an unsupervised machine learning that can assemble a group of data into some clusters with high 
similarity without the necessity of labeled data [19]. Euclidian distances were used to rate the similarity 
between the data in this research. 

The clustering process has been used to detect a failure in a boiler in some research. These 
researches showed that the clustering process could detect a failure and even warned the failure before it 
happened [20], [21]. So, the clustering process has the potential to detect anomalies on a boiler, which is 
followed by leakage failure. Moreover, the more definitive the input variables (how the variables and related 
failure are interconnected), the more possible the user can interpret the component's condition [11]. 

During the clustering process, the number of clusters must be defined. Research, which was trying 
to evaluate the performance of an exhaust fan by the vibration, showed that the effective number of clusters 
was three, which indicates normal, warning, and failure [7]. Even though the three clusters rule seems 
convincing, the data must be tested by making a dendrogram to determine the number of clusters [11]. A 
dendrogram is a graph that represents the distance of each data and shows how close they are by their 
distances. 


2.6. Naive Bayes classification 

Naive Bayes classification is a classification method based on the Bayes probability theorem. This 
classification method has a wide range of applications, such as detecting brain tumors [22], predicting 
purchase [23], text classification [24], and so on. This procedure has also been mentioned as a method to 
prognosis in industrial cases [2]. Label data can also utilize naive Bayes classification to infer the category of 
the new data [25]. In this research, naive Bayes classification is combined with hierarchical clustering. 
Hierarchical clustering will transform the unlabeled data into labeled data to distinguish the normal data from 
the anomaly. Subsequently, the naive Bayes classification will calculate the probability of the data being 
categorized as normal or anomaly based on the labeled data from the clustering process. The probability 
calculation of data is considered normal can be seen in (1)-(3). 
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P(N hee p Yap Yn, a) is the probability of normal condition with the appearance of j*” row in testing data Y, 
which contains n input variables. S,9,m is the amount of normal sample, S is the total sample, X,y is the 
mean of the i*” variable in the normal cluster of training data X, and oj, is the standard deviation of the i” 
variable in the normal cluster of training data X. 

The next step is setting minimum probability for the data determined as a normal condition. (4) 
describes how to obtain the minimum normal probability P,yin . Pmin is the minimum probability of anomaly 
training data (X") considered a normal cluster. If the value of P(N [Yu pYajre Yn, i) > Pmin, then the data is 
determined as normal. Conversely, if P(N [4 paper Yn, i) < Pmin, then the data is determined as an 
anomaly. 


Pmin = fnax {P(N|X's,joX'o,j2-1X'nj)) (4) 


3. RESULTS AND DISCUSSION 

This section presents the training data, then the testing results from the feature selection methods, 
the structural analysis, and PCA. It compares the results regarding the interpretability, false alarms 
occurrence, and longevity of the prediction. The results were quite difference and able to expose the strength 
and weakness of each method. 
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3.1. Data 

Figure 4(a) shows the plot of training data used in structural analysis: PSH inlet temperature, PSH 
outlet temperature, and SSH inlet temperature. Furthermore, Figure 4(b), for principal component 1-3, and 
Figure 4(c), for principal component 4-6, shows the plot of standardized data of the 12 variables in Table 1, 
which was done by PCA and then compressed into six principal components. All data in Figure 4(a), 
Figure 4(b) and Figure 4(c) were taken in the same period in same boiler unit. 

Figure 5 plots the testing data used in this research. Figure 5(a) shows the plot of testing data with 
the same variables in Figure 4(a). Figure 5(b), for principal component 1-3, and Figure 5(c), for principal 
component 4-6, uses the same variables and is also standardized by PCA as in Figure 4(b) and Figure 4(c). 
Both variables in Figures 4 and 5 were taken in the same boiler unit but on different occasions. Testing data 
contains a significant leakage failure that occurred in the 911" sampling, which is marked by a big dot. 


Temperature Measurements in Superheater (Training Data) Temperature Measurements in Superheater (Testing Data) 
440 4 
420 4 
a oa 
2 4 
3 400 4 3 
2 2 
2 4 
5 5 
380+ ro 
© © 
a a 
: : 
* 360 i 
— PSH _inlet — PSH _ inlet 
3404007" PSH_outlet } --- PSH_outlet 
--- SSH_inlet i 150 7 ..... SSH_inlet 
T T T T T t t t r r r r 
Oo 200 400 600 800 1000 o 200 400 600 800 1000 
Sampling Sampling 
(a) (a) 
The 6 Principal Components (1-3) in Boiler's Measurement (Training The 6 Principal Components (1-3) in Boiler's Measurement (Testing Data) 
--- Principal Component 1 -@- Principal Component 1 “ 
Ba Serres Principal Component 2 AY 50 4 +> Principal Component 2 re 
— Principal Component 3 ' 4 —® Principal Component 3 / 
64 ‘ Fy 
1 
1 49 ? 
44 sf 
Th, 
30 5 H on 
=) i 
20 4 H 
94 i 
i 
10 4 | 
-24 
=f oO, 
Oo 200 400 600 800 1000 
Sampling Sampling 
The 6 Principal Components (4-6) in Boiler's Measurement (Training Dat The 6 Principal Components (4-6) in Boiler's Measurement (Testing Data) 


_| --> Principal Component 4 : 
sates Principal Component 5 : 6-4 
— Principal Component 6 H 


=> 


-@- Principal Component 4 
+-® Principal Component 5 
—@- Principal Component 6 


=a] 


3 200 400 600 800 1000 0 200 400 600 800 1000 
Sampling Sampling 
(c) (c) 
Figure 4. Data plotting of (a)training data using Figure 5. Data plotting of (a)testing data using actual 
actual measurement for structural analysis (3 measurement for structural analysis (3 variables); (b) 
variables); (b) training standardized data using testing standardized data using PCA (6 principal 
PCA (6 principal components) from principal | components) from principal component 1-3; and (c) testing 
component 1-3; and (c) training standardized standardized data using PCA (6 principal components) 
data using PCA (6 principal components) from from principal component 4-6. The major leakage failure 
principal component 4-6 is marked by dots in each variable (911% sample) 
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3.2. Interpretability 

Interpretable machine learning has become an exciting topic these days. It gives valuable 
information and helps the user in the decision-making process. However, to understand the result from 
machine learning, the users should understand the basic concept of the related object and compare the 
observation from the machine learning to the real to get the real insight of the event. Figure 6 shows the 
comparison of dendrogram in the training data, using structural analysis and PCA. Figure 6(a) is a 
dendrogram constructed by structural analysis. The data was divided into three optimal clusters, which can be 
observed deeply using a scatter plot. 

On the other hand, Figure 6(b) was constructed by PCA with six principal components, which 
differed the data into two clusters. The differences occur because of different selected variables in each 
method. Structural analysis intensely focused on the superheater components variables, which is considered 
the leading cause of the leakage. At the same time, PCA took broader aspects of the boiler, which generalized 
its observation, resulting in different diagnoses. 

The comparison of scatter plots using the training dataset between structural analysis and PCA can 
be seen in Figure 7(a) and Figure 7(b). Figure 7(a) shows the clustering result of structural analysis feature 
selection, while Figure 7(b) shows the clustering result of PCA feature selection. Feature selection method 
using structural analysis offered simplicity to explain the character of the possible anomaly. From the 
structural analysis perspective, the anomaly, followed by leakage failure, showed unstable measurement 
(higher or lower than the usual). The abnormality of the heat transfer process caused instability during the 
combustion. Slugging in the pipeline could cause the cause of higher steam temperature. In comparison, the 
lower temperature could be caused by the defect of the pipeline material (such as corrosion, deteriorating 
material, and so on) that kept the pipeline absorbing the heat without transferring it to the steam until the 
pipeline ruptured. Even though PCA covered more variables in the process, PCA didn't show any sign of 
interpretability because the principal components did not have any physical meaning or explanation 
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Figure 6. Comparison of dendrogram in the training data using (a) structural analysis, and (b) PCA (6 principal 
components). The anomaly cluster is marked by black circle; the normal cluster is marked by grey or light grey 


3.3. False Alarm Occurrence and Longevity of the Prediction 

Both methods showed significant differences in the false alarm and length of prediction categories, 
as shown in Figure 8. A blue dot in the figure represented the event when the actual leakage was confirmed, 
and the red line indicates how long the prediction was before the leakage began. The anomaly state was 
confirmed when the status value on that time equals 1. Alternatively, it is normal when the status value is 0. 

Feature selection with PCA improved the longevity of the prediction by up to 25 hours. However, 
the alarm kept being triggered for the whole week because it detected too many anomalies in other 
components even though they were not significantly related to the leakage event. On the other hand, feature 
selection with structural analysis could only predict 13 hours and 40 minutes before the leakage. However, 
the false alarm rate was so low that only five false alarms were activated during the testing. The low false 
alarm rate in structural analysis happened because it only focused on the superheater area where most of the 
leakage failure occurred. 
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Figure 7. Scatter plot of training data using (a) structural analysis (3 variables), and (b) PCA (6 principal 
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Figure 8. The status value of anomaly detection on testing data using (a) structural analysis and (b) PCA 


4. CONCLUSION 
The implementation of different feature selection methods brought out remarkable results. For 


example, PCA serves an extended prediction period of up to 25 hours with no requirement to understand the 
process cycle in the boiler. Still, the alarm was triggered during the whole week, and the observation was not 
interpretable. PCA yields these problems because it indirectly monitors more variables. As a result, it 
indicates the abnormality in different sections as the suspected anomaly, despite not being related to the 
failure. Conversely, the structural analysis could predict the leakage 13 hours 40 minutes (much later than 
PCA) before the occurrence. However, the result from the clustering process was possible to be interpreted. 
Furthermore, steam temperature variation in the superheater represented the anomaly in the heat transfer 
between the pipeline and steam, which led to the leakage. Therefore, the false alarm rate by structural 
analysis was lower than the PCA. Hopefully, the combination of PCA and structural analysis can be 


developed to predict leakage in future research. 
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